Exploring Equity Classifications with Machine Learning (Census Tracts)

This is a continuation of the initial study on County Subdivisions in the MPO. Given the relative lack of variation at the county subdivision level, this extension was intended to test the methodology used previously to tracts to take advantage of their higher levels of variation. Smaller geometries taken as a group tend to have higher levels of variation because there is less aggregation that masks variation. We know that the Boston region has been residentially segregated by race historically and in the present which is more accurately shown in smaller geographies. For example, in the majority minority City of Boston, the neighborhoods Roxbury, Dorchester, and Mattapan are where much of the Black population of Boston lives, while many other neighborhoods are majority white and the suburbs of Boston mostly remain highly majority white. Looking at different levels of geography allows exploration of patterns in demographic data that might not be found when aggregated to the county subdivision level.

https://www.bostonmagazine.com/news/2020/12/08/boston-segregation/

https://www.tbf.org/-/media/tbf/reports-and-covers/2019/gbhrc-chapters/gbhrc19-chapter-3--segregation.pdf

Conclusion:

While the purpose of this was to see if the greater variation in the demographic data used at the tract level would show clustering, it does not look like it does. The overall pattern seems to be a large cluster made up by a sizeable range of data that is spread out over a gradiant instead of clustered, and some features that are outliers which DBSCAN categorizes as noise.

There are many other unsupervised machine learning algorithms that could be applied to this data in order to explore other results, but after doing some exploring of the data through this study, I would not suggest it. In general, we would apply other algorithms to capture clusters that have features that require other methods of conceptualizing them than the algorithms used. Even though it is hard to capture data in nine dimensions, from the visualizations of the results, it does not actually seem like there are multiple clusters no matter the algorithm.

Something to note is that this does not preclude unsupervised machine learning from being a useful tool in exploring patterns in demographic data, just not at this scale and with these particular variables. While useful clusters were not found, the pattern that was uncovered was the gradient of values for all the variables used. This confirms something that we have known for a while, using thresholds to define 'Equity' areas cannot capture the the true patterns of the data.

Demographic Tables Fields Calculation Notes
Race/Ethnicity ACS14: B03002 B03002_001-B03002_003 Total - White Alone
Limited English Proficiency ACS14: B16001 B16001_005 + B16001_008 + B16001_011 + B16001_014 + B16001_017 + B16001_020 + B16001_023 + B16001_026 + B16001_029 + B16001_032 + B16001_035 + B16001_038 + B16001_041 + B16001_044 + B16001_047 + B16001_050 + B16001_053 + B16001_056 + B16001_059 + B16001_062 + B16001_065 + B16001_068 + B16001_071 + B16001_074 + B16001_077 + B16001_080 + B16001_083 + B16001_086 + B16001_089 + B16001_092 + B16001_095 + B16001_098 + B16001_101 + B16001_104 + B16001_107 + B16001_110 + B16001_113 + B16001_116 + B16001_119 C16001 (less than very well)/(Total Population - (B01001_003 +B01001_027)
Median Income ACS14: B19013 B19013_001
% of HH with income below 200% of poverty line ACS14: C17002 C17002_002E + C17002_003E + C17002_004E + C17002_005E + C17002_006E + C17002_007E
Low Income Households ACS14: B19001, B19025, B11001 All of B19001, B19025_001, B11001_001 HH Income Ranges, Aggregate HH Income, Total HH
No Car Households ACS14: B08201 B08201_002 HH with no vehicles available
Population Density https://jtleider.github.io/censusdata/api.html B01001_001/AREA Total Pop / AREA These are shape, also use total population data
Children ACS14: B01001, 2010Cen: P12 (B01001_003 + B01001_004 + B01001_005 + B01001_006 + B01001_027 + B01001_028 + B01001_029 + B01001_030), (P012_003 + P012_004 + P012_005 + P012_006 + P012_027 + P012_028 + P012_029 + P012_030) Boys under 18 plus girls under 18 0-17
Population Over 5 ACS14: B01001, 2010Cen: P12 B01001_001 - (B01001_003 +B01001_027), P012_001 - (P012_003 + P012_027) Total Population - Children under 5 5+
Seniors ACS14: B01001 (B01001_020 + B01001_021 + B01001_022 + B01001_023 + B01001_024 + B01001_025) + (B01001_044 + B01001_045 + B01001_046 + B01001_047 + B01001_048 + B01001_049) Men ages 65+ plus Women ages 65+ 65+
People with Disabilities ACS14: S1810 S1810_C02_001E / S1810_C01_001E ( total pop with disability / total non institutionalized population) Includes: Ambulatory, Hearing, Vision, Self-Care, Cognitive, Independent Living Difficulties
Total Population ACS14: B01001, 2010Cen: P1 B01001_001, P01_001 Includes those housed in group quarters

Sources:

Finding Tables and Fields Resources

https://www.census.gov/prod/cen2010/doc/sf1.pdf

https://www.census.gov/programs-surveys/acs/technical-documentation/table-shells.2014.html

Grab ACS Data, Brief Clean, and Sum if Necessary

Machine Learning Section

Charts of Variables

Make graphs of all the variables to see what the range looks like

So you don't have to scroll if you don't want: newplot2-2.png

Use Silhouette Analysis to Determine Optimital Number of Clusters for K-Means

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py

Analysis is also used below in testing the fit of the DBSCAN model clusters to the data.

Make Elbow Diagram for K-Means (Optimizing the Number of Clusters)

https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

Try K-Means

(https://static1.squarespace.com/static/5ff2adbe3fe4fe33db902812/t/6062a083acbfe82c7195b27d/1617076404560/ISLR%2BSeventh%2BPrinting.pdf)

Summary: K Means

Essentially - we do get three clusters and they are interesting

What the above parallel coordinates plot show is a visualization of the clusters based on the data for each town. As you can see, Pink (1) and Yellow (2) are pretty similar, but Blue (0) looks like a combination of outliers and data following a different pattern. Something that DBSCAN does naturally is not assign outliers to clusters. See below.

Try DBSCAN

Quality of the clusters is low as seen from the Silhouette score. However, we may as well visualize it to see the results.

As you can see below, there is significant overlap between the two clusters and the noise. While we have many more dimensions that could be explored, the low silhouette score's proximity to zero means that the clusters likely do overlap to some extent (in 9D) which breaks the point of having multiple clusters in the first place.

Try Principal Component Analysis

As before, dimension reduction for visualization and possible functional improvements by looking at the relationships of the variables to each other instead of the variable values themselves.

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

https://drive.google.com/file/d/1YdA-HHYP1V05QgvwLCvfnuuau67Zl38n/view

Above you can see the results of the PCA. Given that PC1-3 gives us 79% of the variation explained, we will just use the first three PCs in order to not get as close to 90% which retains more data than I really want to given that we are trying to not just replace the data with relationships but instead choose a few substantial relationships and reduce the dimensions.

Try K Means Again!!

Here we used the k value of 3 because while the elbow diagrams were inconclusive, the silhouette score was clear and still corresponds to the inertia elbow diagram. Below is the results plotted in 3D. The clusters all seem to be attached, which makes one question if there should even be more than 1 cluster anyways.

While the Parallel Coordinates Plot does show how the clusters out of K-means are related to their own and other features (tracts). However, the pattern that each cluster is centered around are all overlapping another cluster for a large number of features. This shows more of a gradient pattern again than a cluster pattern. To confirm that and the theory that there still is only one major cluster with noise, we run DBSCAN again.

Try DBSCAN Again!

Yet again, good fit with a bad result (e.g. one cluster).